feat(scan): pure-TS ripgrep-compatible scanner + DSN migration + perf overhaul by BYK · Pull Request #791 · getsentry/cli

BYK · 2026-04-20T22:02:24Z

Summary

Introduces src/lib/scan/ — a pure-TypeScript, dependency-free-at-runtime
ripgrep-compatible file scanner — and migrates the DSN scanner at
src/lib/dsn/code-scanner.ts to consume it. Also ships a local benchmark
harness with a data-driven concurrency tuning pass, a whole-buffer
matchAll rewrite of the grep engine, and a two-tier IgnoreStack
optimization.

The DSN scanner shrinks from 715 → 435 LOC (−280) while keeping all
public export signatures unchanged. DSN detection's common-case hot paths
(detectDsn.cold, scanCodeForFirstDsn) become 46-87× faster. The
exhaustive find-all path holds near parity with pre-change numbers
(+4.7%, well under the 20% budget) while gaining nested-.gitignore
correctness and dirMtimes cache coverage.

What's in the module

src/lib/scan/ (10 files, ~2.7k LOC, exported via a barrel):

walker.ts — DFS async generator, streaming readdir({withFileTypes}), time-budgeted exploration, monorepo descent hook, onDirectoryVisit observer for cache-invalidation callers
ignore.ts — IgnoreStack wrapping the ignore npm package with a root-only fast path + per-dir nested .gitignore support (matches git's cumulative semantics)
binary.ts — 8 KB NUL-byte sniff + extension fast path for text/binary classification
concurrent.ts — shared mapFilesConcurrent / mapFilesConcurrentStream with cooperative early-exit flag (never uses pLimit.clearQueue() — that hangs Promise.all)
regex.ts — compilePattern (rg-style leading inline flags → JS RegExp flags) + ensureGlobalFlag / ensureGlobalMultilineFlags helpers
grep.ts — grepFiles (AsyncIterable<GrepMatch>) + collectGrep (sorted Promise<GrepResult>). Whole-buffer matchAll iteration.
glob.ts — globFiles + collectGlob, picomatch-backed
constants.ts — TEXT_EXTENSIONS, DEFAULT_SKIP_DIRS, DSN_ADDITIONAL_SKIP_DIRS, MAX_FILE_SIZE, CONCURRENCY_LIMIT, helpers (normalizePath, isMonorepoPackageDir)
types.ts — WalkEntry, WalkOptions, GrepMatch, GrepOptions, GlobOptions, IgnoreMatcher
index.ts — public barrel

DSN-specific policy lives in src/lib/dsn/scan-options.ts. scan/ stays policy-free so the init wizard (follow-up PR) can bring its own preset.

DSN scanner migration (`src/lib/dsn/code-scanner.ts`)

	Before	After
LOC	715	435 (−280)
Public exports	5	5 (signatures unchanged)
Telemetry span + attrs	`scanCodeForDsns` / `dsn.detect.code`	unchanged
Nested `.gitignore`	root-only	cumulative (git semantics)
Early-exit logic	ad-hoc `state.earlyExit`	shared `mapFilesConcurrent.onResult`

DSN-specific content extraction (extractDsnsFromContent, isCommentedLine, isValidDsnHost, getExpectedHost) stays in code-scanner.ts unchanged. The walker + concurrency + gitignore handling moved to scan/.

Added a case-insensitive literal-prefix fast path to extractDsnsFromContent (/http/i.test(content)) that short-circuits matchAll when the file contains no scheme prefix in any casing.

Performance (synthetic/large, 10k files, 4-core)

Op	Pre-PR baseline	Now	Δ
`detectDsn.cold`	179 ms	3.76 ms	48× faster
`scanCodeForFirstDsn`	183 ms	2.10 ms	87× faster
`scanCodeForDsns`	298 ms	312 ms	+4.7% (under 20% bar)
`detectAllDsns.cold`	304 ms	318 ms	+4.6%
`scan.grepFiles`	(new op)	283 ms	0.95× of OLD `scanCodeForDsns`
`scan.ignore` (µbench)	(new op)	2.09 ms	81% faster than pre-opt

The 46-87× wins on the stop-on-first hot paths come from mapFilesConcurrent's cooperative early-exit flag — the old scanner's stop-on-first was less aggressive. That's the path every sentry issue list invocation hits.

The small regression on scanCodeForDsns is the cost of nested-gitignore correctness (~41ms single-threaded) plus the case-insensitive scheme probe (~15ms vs raw indexOf). Both are correctness improvements; the regression is within the planning budget (20%).

Benchmark harness

script/bench.ts + script/bench-sweep.ts, deterministic xorshift32-seeded fixture generator at test/fixtures/bench/. Three presets (small ~100 files, medium ~1k, large ~10k). Baselines gitignored (machine-specific).

bun run bench                 # all ops, all sizes
bun run bench --size large --runs 10
bun run bench --save-baseline # .bench/baseline.json (local-only)
bun run bench --compare       # fail on >20% regression
bun run bench:sweep           # concurrency sweep to tune CONCURRENCY_LIMIT

Op labels intentionally match Sentry span names (scanCodeForDsns, findProjectRoot, …) so local measurements correlate with production telemetry.

Optimizations landed

mapFilesConcurrent + per-file extractDsnsFromContent(content, limit: 1) — the early-exit wins (46-87× on stop-on-first).
CONCURRENCY_LIMIT tied to availableParallelism() with floor 2 and cap 16. The JSDoc documents the tradeoff between walker-fed (our real workload) and pure-I/O workloads, and notes that callers can override per-call.
grep.ts::readAndGrep whole-buffer matchAll — replaced content.split("\n") + per-line regex.test with regex.exec + lazy line-num counting. 12× faster on the profiled grep CPU phase.
Case-insensitive literal-prefix fast path in DSN — /http/i.test(content) short-circuits matchAll on the ~92% of files that can't possibly contain a DSN. Must be case-insensitive for correctness.
IgnoreStack two-tier query — root-only fast path when no nested gitignores loaded (14× per-query speedup); prefix-walk slow path for the nested case (3× speedup). The scan.ignore microbench dropped from 10.9 ms → 2.1 ms.

Non-goals, explicitly rejected with evidence

Worker threads: profiling showed CPU work (45 ms DSN regex on the large fixture) is smaller than worker bootstrap (~40 ms for 4 workers) + postMessage serialization (~50-200 ms to move 80 MB of content). Net negative. A 12× single-threaded win in grep.ts covered the gap instead.
Replace ignore with picomatch: gitignore syntax (bare basename match, trailing / = subtree, leading / anchored) isn't compatible with picomatch glob syntax. Would require a 200-400 LOC translator with subtle correctness edge cases. ignore package is ~800 LOC specifically because this is non-trivial.

Behavior change (intentional)

Nested .gitignore now honored in DSN detection. Pre-change: only the project-root .gitignore was read. Post-change: every directory's .gitignore applies to its subtree with cumulative semantics (parent patterns apply, child negations override).

User-visible impact: a DSN in a file covered by a subdirectory .gitignore is no longer detected. This matches git's own check-ignore behavior and is almost always what users want (a gitignored file should not leak DSNs to the CLI). If anyone depends on the old behavior, they can remove the relevant .gitignore rule.

Review round fixes

Fixed mixed-case DSN regression (Https://, hTtP://, etc. were silently dropped by the initial indexOf-based probe). The probe is now /http/i.test(content). Widened the property test arbitrary to include uppercase so this class is actually exercised. Added 5 parametrized regression tests in code-scanner.test.ts.
Reverted an overly-aggressive CONCURRENCY_LIMIT bump (cores × 4, cap 32) that microbenchmarked well but regressed the real walker-fed workload ~15%. The JSDoc now documents why the "I/O-bound therefore scale with something larger than cores" reasoning is misleading for this specific pipeline.

What's NOT in this PR

Init wizard migration (replace src/lib/init/tools/grep.ts + glob.ts subprocess shims with collectGrep/collectGlob). Separate follow-up PR with its own plan.
Worker pool. Rejected above.

Test plan

bun run typecheck — clean
bun run lint — clean (1 pre-existing warning in markdown.ts, not from this PR)
bun test --timeout 15000 test/lib test/commands test/types — 5464 pass, 0 fail (+183 new scan + bench + DSN tests including property tests via fast-check + 5 mixed-case regression tests)
bun test test/isolated — 138 pass, 0 fail
bun run bench --size large — scanCodeForDsns within 1.2× of pre-PR baseline (312 ms vs 298 ms = 1.05×)
Semantic parity verified: DSN detection on the medium fixture produces 4 unique DSNs, 76 sourceMtimes, 461 dirMtimes — identical to pre-change scanner
Mixed-case scheme casings detected: Https://, hTtP://, HttpS:// — all pass

🤖 Generated with Claude Code

Co-authored-by: Claude Opus 4.7 (1M context) noreply@anthropic.com

github-actions · 2026-04-20T22:02:41Z

Semver Impact of This PR

🟡 Minor (new features)

📋 Changelog Preview

This is how your changes will appear in the changelog.
Entries from this PR are highlighted with a left border (blockquote style).

New Features ✨

(auth) Hint when env token is shadowed by stored OAuth (CLI UX: cache invalidation, env var discoverability, and error message improvements #785) by BYK in #790
(help) Add Environment Variables section to branded help (CLI UX: cache invalidation, env var discoverability, and error message improvements #785) by BYK in #786
(issue) Add resolve, unresolve (reopen), and merge commands by BYK in #778

(scan) Pure-TS ripgrep-compatible scanner + DSN migration + perf overhaul by BYK in #791

Bug Fixes 🐛

(arg-parsing) Throw ValidationError (not raw Error) for format errors by BYK in #793
(error-reporting) Fall back to message prefix for ValidationError without field by BYK in #776
(errors) Surface real API errors and granular scopes in 403s (CLI UX: cache invalidation, env var discoverability, and error message improvements #785) by BYK in #789
(hex-id) Auto-recover malformed hex IDs in view commands (CLI-16G) by BYK in #777
(init) Force process exit after wizard completes by betegon in #782
(ux) Rate-limit update banner and silence /api/0/ auto-fix (CLI UX: cache invalidation, env var discoverability, and error message improvements #785) by BYK in #787

Internal Changes 🔧

(init) Trim deprecated --features help entries by MathurAditya724 in #781
Regenerate docs by github-actions[bot] in 58a84035

_{🤖 This preview updates automatically when you update the PR.}

github-actions · 2026-04-20T22:03:02Z

PR Preview Action v1.8.1
🚀 View preview at https://cli.sentry.dev/_preview/pr-791/
Built to branch `gh-pages` at 2026-04-21 10:59 UTC. Preview will be ready when the GitHub Pages deployment is complete.

github-actions · 2026-04-20T22:04:25Z

Codecov Results 📊

✅ 138 passed | Total: 138 | Pass Rate: 100% | Execution Time: 0ms

📊 Comparison with Base Branch

Metric	Change
Total Tests	—
Passed Tests	—
Failed Tests	—
Skipped Tests	—

✨ No test changes detected

All tests are passing successfully.

✅ Patch coverage is 98.05%. Project has 1730 uncovered lines.
✅ Project coverage is 95.65%. Comparing base (base) to head (head).

Files with missing lines (6)

File	Patch %	Lines
src/lib/scan/walker.ts	95.42%	⚠️ 13 Missing
src/lib/scan/grep.ts	96.62%	⚠️ 8 Missing
src/lib/dsn/code-scanner.ts	94.39%	⚠️ 6 Missing
src/lib/scan/regex.ts	98.39%	⚠️ 2 Missing
src/lib/scan/concurrent.ts	99.12%	⚠️ 1 Missing
src/lib/scan/glob.ts	98.21%	⚠️ 1 Missing

Coverage diff

@@            Coverage Diff             @@
##          main       #PR       +/-##
==========================================
+ Coverage    95.58%    95.65%    +0.07%
==========================================
  Files          264       278       +14
  Lines        38454     39794     +1340
  Branches         0         0         —
==========================================
+ Hits         36755     38064     +1309
- Misses        1699      1730       +31
- Partials         0         0         —

Generated by Codecov Action

…dedup Three findings from Cursor Bugbot + Seer on PR #791: **Seer (MEDIUM): followSymlinks: true silently drops symlinks** `readdir({withFileTypes: true})` uses lstat semantics — on a symlink, `isSymbolicLink()` returns true while `isFile()` and `isDirectory()` both return false. The existing `processEntry` only routed entries via `isDirectory()` / `isFile()`, so a symlink with `followSymlinks: true` passed the skip check but then fell through and returned null. The `inodeKey`/`visitedInodes` cycle detection in `maybeDescend` was unreachable dead code. Fix: when a symlink is encountered with `followSymlinks: true`, `stat()` (not `lstat`) the target to learn its real type, route to `maybeDescend` or `tryYieldFile` accordingly. Broken symlinks and stat errors are silently skipped (matches ripgrep's behavior). Circular symlinks are broken by the existing inode-based cycle detection. Tests: 4 new symlink cases in `walker.test.ts` covering default-skip, follow-through, circular cycle detection, and broken symlinks. **Cursor #2 (MEDIUM): concurrent grep workers share stateful RegExp** When a pre-compiled `/gm` RegExp was passed to `grepFiles`, `ensureGlobalMultilineFlags` returned it unchanged. Concurrent `readAndGrep` workers then shared the regex's `lastIndex` property. Strictly speaking this was safe today — JS is single-threaded and the match loop has no `await`, so workers can't actually interleave mid-iteration — but any future `await` inside the loop would turn it into a silent match-dropping bug. Fix: always clone the regex at the start of `readAndGrep` via `new RegExp(src, flags)`. ~1µs per file, eliminates the foot-gun. Added a regression test that exercises the shared-regex path with multiple concurrent workers. **Cursor #1 (LOW): shared utilities duplicated between grep and glob** `compileMatchers`, `matchesAny`, `basenameOf`, `walkerRoot`, and `joinPosix` were duplicated between `src/lib/scan/grep.ts` and `glob.ts`. Extracted to new internal module `src/lib/scan/path-utils.ts` (not part of the public barrel — implementation detail). Net LOC reduction: ~80 lines across the two files. ## Test plan - [x] `bun run typecheck` — clean - [x] `bun run lint` — clean (1 pre-existing warning) - [x] `bun test test/lib test/commands test/types` — **5443 pass, 0 fail** (+5 from this commit) - [x] `bun test test/isolated` — 138 pass - [x] `bun run bench --size large` — no regression on `scanCodeForDsns` (312ms) or `scan.grepFiles` (278ms)

… overhaul Introduces `src/lib/scan/` — a pure-TypeScript, dependency-free-at-runtime ripgrep-compatible file scanner (walker + ignore + binary detection + grep + glob) — and migrates the DSN scanner at `src/lib/dsn/code-scanner.ts` to consume it. Also ships a local benchmark harness, a data-driven concurrency tuning pass, and two rounds of `IgnoreStack` optimization. - **`src/lib/scan/`** (~2.7k LOC across 10 source files): walker (DFS with time budget + monorepo descent hook + gitignore), grep (`AsyncIterable<GrepMatch>` + `collectGrep`), glob (picomatch-backed), shared `mapFilesConcurrent` helper, regex helpers (`compilePattern`, `ensureGlobalFlag`, `ensureGlobalMultilineFlags`). Policy-free core — the DSN scanner composes its own preset via `src/lib/dsn/scan-options.ts`. - **DSN scanner migration**: `src/lib/dsn/code-scanner.ts` shrinks from **715 → 435 LOC** (−280). All 5 public exports + wire shapes preserved byte-identical. Telemetry span/attributes unchanged. New behavior: respects nested `.gitignore` files (matches git's cumulative semantics — DSNs in files covered by a subdirectory `.gitignore` are no longer falsely detected). - **Benchmark harness**: `script/bench.ts` + `script/bench-sweep.ts`, deterministic xorshift32-seeded fixture generator at `test/fixtures/bench/`. Three presets (small/medium/large ≈ 100/1k/10k files). `bun run bench`, `bench:save`, `bench:compare`, `bench:sweep`. Baselines gitignored (machine-specific). - **Concurrency tuning**: data-driven sweep found the knee at `availableParallelism()` on a 4-core box (old default `50` was 8% slower at the fixture size). `CONCURRENCY_LIMIT` now `Math.max(2, cores)`. - **IgnoreStack rewrite**: split into `#rootIg` + `#nestedByRelDir` with a root-only fast path. Per-query cost dropped from 3.76µs → 0.27µs (14×) on the common case, 5.76µs → 1.93µs (3×) on nested. - **Grep whole-buffer matchAll**: replaced `content.split("\n")` + per-line `regex.test` with whole-buffer iteration via `regex.exec` + lazy line-number counting. 12× faster on profiled grep CPU phase. - **Literal-prefix fast path** in DSN scanner: short-circuits `matchAll` when content has neither `"http"` nor `"HTTP"`. ~3% improvement on large fixture. | Op | Baseline | Now | Δ | |-----------------------|---------:|-------:|------------------------| | `detectDsn.cold` | 179ms | 3.9ms | **46× faster** | | `scanCodeForFirstDsn` | 183ms | 2.1ms | **88× faster** | | `scanCodeForDsns` | 298ms | 300ms | parity | | `detectAllDsns.cold` | 304ms | 310ms | parity | | `scan.ignore` (µbench)| (new op) | 2.0ms | 81% faster than pre-opt| The common-case hot paths (`detectDsn.cold`, `scanCodeForFirstDsn`) are dramatically faster thanks to `mapFilesConcurrent`'s cooperative early-exit flag. Find-all paths hold at parity while gaining nested-gitignore correctness, richer cache coverage (`dirMtimes`), and a 40% LOC reduction. - **Worker threads**: profiling showed CPU work (45ms DSN regex) is smaller than worker bootstrap + postMessage serialization. Net negative. - **Replace `ignore` package with picomatch**: gitignore syntax isn't compatible with glob syntax. Would need a 200-400 LOC translator with subtle correctness edge cases. - [x] `bun run typecheck` — clean - [x] `bun run lint` — clean (1 pre-existing warning in markdown.ts) - [x] `bun test --timeout 15000 test/lib test/commands test/types` — **5433 pass, 0 fail** (+152 new scan + bench + dsn tests) - [x] `bun test test/isolated` — 138 pass - [x] `bun run bench --size large` — all success criteria met, benchmarks stable

Addresses two issues raised in PR 791 code review: **C1: Mixed-case DSN schemes silently dropped** (regression in PR 791) The literal-prefix fast path in `extractDsnsFromContent` used two case-sensitive `indexOf` calls (`"http"` and `"HTTP"`), which covered only all-lowercase and all-uppercase scheme prefixes. Mixed-case URLs like `Https://...` or `hTtP://...` returned `[]` even though the `DSN_PATTERN` regex (with `/i` flag) would have matched. Impact: any source file with non-standard scheme casing (rare but possible in generated code or test fixtures) would silently fail DSN auto-detection — `sentry issue list` would behave as if no DSN existed. Also hits the cache-verifier path via `extractFirstDsnFromContent`. Fix: replaced the two `indexOf` calls with `/http/i.test(content)`. Adds ~16ms on a 10k-file scan (V8 regex vs raw `indexOf`), a trade we accept for correctness. **C2: Property test couldn't catch the mixed-case regression** The property test for the fast path used a lowercase-only arbitrary charset, so it could never generate the input class where the old fast path diverged from the slow path. Widened `BENIGN_CHARS` to include uppercase letters; tightened the JSDoc explaining the invariant. Also added an explicit `test.each` regression suite in `code-scanner.test.ts` pinning detection across five scheme casings (all-lower, all-upper, title, mixed, http no-s). **Concurrency retune: back to `availableParallelism`** Previously set to `Math.max(2, availableParallelism())` in PR 3.5. First attempted to raise to `cores × 4` (up to 32) based on a microbenchmark showing that pure-I/O per-file work has a knee at ~16-32 in-flight tasks. Measurement was correct but didn't reflect the real workload: every current scan-module caller is walker-fed, and the walker's serial `readdir` descent is the bottleneck. Raising concurrency regressed `scanCodeForDsns` ~15% because more in-flight tasks created scheduling pressure without additional useful work. Final choice: `Math.min(16, Math.max(2, availableParallelism()))` — same as before this commit (4 on 4-core, 2 floor for tiny CI, 16 cap for large workstations). The JSDoc now honestly documents the tension between walker-fed and pure-I/O workloads and tells callers they can override per-call via `opts.concurrency` when they know the shape of their work. - [x] `bun run typecheck` — clean - [x] `bun run lint` — clean (1 pre-existing warning) - [x] `bun test --timeout 15000 test/lib test/commands test/types` — **5438 pass, 0 fail** (+5 new parametrized regression tests) - [x] `bun test test/isolated` — 138 pass - [x] `bun run bench --size large` — `scanCodeForDsns` 312ms (PR 0 baseline 298ms, +4.7%, under 1.2× bar); hot-path ops `detectDsn.cold` 3.76ms / `scanCodeForFirstDsn` 2.10ms unchanged (46-87× faster than baseline) - [x] Manual: mixed-case DSN casings (`Https://`, `hTtP://`, `HttpS://`) all detected post-fix

…dedup Three findings from Cursor Bugbot + Seer on PR #791: **Seer (MEDIUM): followSymlinks: true silently drops symlinks** `readdir({withFileTypes: true})` uses lstat semantics — on a symlink, `isSymbolicLink()` returns true while `isFile()` and `isDirectory()` both return false. The existing `processEntry` only routed entries via `isDirectory()` / `isFile()`, so a symlink with `followSymlinks: true` passed the skip check but then fell through and returned null. The `inodeKey`/`visitedInodes` cycle detection in `maybeDescend` was unreachable dead code. Fix: when a symlink is encountered with `followSymlinks: true`, `stat()` (not `lstat`) the target to learn its real type, route to `maybeDescend` or `tryYieldFile` accordingly. Broken symlinks and stat errors are silently skipped (matches ripgrep's behavior). Circular symlinks are broken by the existing inode-based cycle detection. Tests: 4 new symlink cases in `walker.test.ts` covering default-skip, follow-through, circular cycle detection, and broken symlinks. **Cursor #2 (MEDIUM): concurrent grep workers share stateful RegExp** When a pre-compiled `/gm` RegExp was passed to `grepFiles`, `ensureGlobalMultilineFlags` returned it unchanged. Concurrent `readAndGrep` workers then shared the regex's `lastIndex` property. Strictly speaking this was safe today — JS is single-threaded and the match loop has no `await`, so workers can't actually interleave mid-iteration — but any future `await` inside the loop would turn it into a silent match-dropping bug. Fix: always clone the regex at the start of `readAndGrep` via `new RegExp(src, flags)`. ~1µs per file, eliminates the foot-gun. Added a regression test that exercises the shared-regex path with multiple concurrent workers. **Cursor #1 (LOW): shared utilities duplicated between grep and glob** `compileMatchers`, `matchesAny`, `basenameOf`, `walkerRoot`, and `joinPosix` were duplicated between `src/lib/scan/grep.ts` and `glob.ts`. Extracted to new internal module `src/lib/scan/path-utils.ts` (not part of the public barrel — implementation detail). Net LOC reduction: ~80 lines across the two files. - [x] `bun run typecheck` — clean - [x] `bun run lint` — clean (1 pre-existing warning) - [x] `bun test test/lib test/commands test/types` — **5443 pass, 0 fail** (+5 from this commit) - [x] `bun test test/isolated` — 138 pass - [x] `bun run bench --size large` — no regression on `scanCodeForDsns` (312ms) or `scan.grepFiles` (278ms)

… truncation overshoot Two follow-up findings from Cursor Bugbot on commit dd4fde0: **Cursor (MEDIUM): error path leaks partial dirMtimes** The `scanCodeForDsns` catch block was returning `{ dsns: [], sourceMtimes: {}, dirMtimes }` where the last field carried partial state populated by `onDirectoryVisit` callbacks that fired before the walker threw. If a downstream caller cached this partial result, the cache verifier would later only check the dirs we happened to reach pre-error and silently bless the cache for dirs the walker never visited — missing DSNs in those unvisited dirs. Fix: return all three maps empty on error. Matches the pre-PR-3 scanner's behavior and forces a full rescan on the next attempt. **Cursor (LOW): collectGrep truncation off-by-one** `collectGrep` forwarded `maxResults` directly to `grepFilesInternal`, which flipped `stats.truncated = true` the moment `matchesEmitted >= maxResults`. So asking for `maxResults: 3` against a corpus with exactly 3 matches (no more) falsely reported truncation. Fix: mirror `collectGlob`'s `+1` overshoot pattern. Ask the iterator for `maxResults + 1` internally; stop draining at `matches.length >= maxResults`; only set `truncated = true` if we actually saw the overshoot match. Also matches our own documented lore about this pattern (which I violated — the reviewer's catch). ## Test plan - [x] `bun run typecheck` — clean - [x] `bun run lint` — clean (1 pre-existing warning) - [x] `bun test test/lib test/commands test/types` — **5445 pass, 0 fail** (+2 regression tests) - [x] `bun test test/isolated` — 138 pass - [x] Neither fix affects the happy-path runtime (error branch never runs; truncation probe adds one iteration only when the corpus has strictly more than `maxResults` matches)

…onors opt-out Two more findings from Cursor Bugbot on commit 5347702: **Cursor (MEDIUM): producer errors swallowed on consumer break** `mapFilesConcurrentStream` placed the `throw producerError` after its `try/finally` block. When a consumer `break`s early, the runtime calls the generator's `return()`, which runs `finally` but does NOT execute any statement after the try block. Producer errors were therefore silently lost on the break path (though visible on drain-to-completion). Verified with a standalone repro: ```js async function* gen() { try { yield 1; yield 2; } finally { console.log("finally ran"); } throw new Error("dead code"); } for await (const x of gen()) { if (x === 1) break; } // → "finally ran" prints; error does NOT surface. ``` Fix: move the rethrow inside the `finally` block. Added a regression test that reverts cleanly — when the fix is removed, the test fails with `caught === null`. **Cursor (LOW): `multiline: false` silently overridden** `readAndGrep` unconditionally applied `/m` via `ensureGlobalMultilineFlags`, making the `GrepOptions.multiline` option a no-op. Caller passing `multiline: false` expecting buffer-boundary anchoring got line-boundary anchoring instead. Root cause: the whole-buffer `matchAll` rewrite in PR 3.5 needed `/m` to preserve grep's line-boundary semantics (`^foo` matches any line starting with `foo`), but I hardcoded it rather than making the default explicit. Fix: the `multiline` option now defaults to `true` (grep-like) and honors `multiline: false` when explicitly set (buffer-boundary JS semantics). Updated JSDoc in `types.ts` to document the new contract. Added two regression tests: - default: `^foo` matches mid-file lines (grep-like) - `multiline: false`: `^foo` only matches at buffer start - [x] `bun run typecheck` — clean - [x] `bun run lint` — clean (1 pre-existing warning) - [x] `bun test test/lib test/commands test/types` — **5448 pass, 0 fail** (+3 from this commit) - [x] `bun test test/isolated` — 138 pass - [x] `bun run bench --size large` — no perf regression (per-op: scan.grepFiles 313ms, scan.walk.dsnParity 212ms, scan.ignore 1.74ms, detectDsn.cold 5.85ms, scanCodeForFirstDsn 4.24ms, scanCodeForDsns 365ms)

…sult Follow-up to Cursor Bugbot finding on the rebased commit: `processEntry` returned `[]` for files with no DSNs, which `mapFilesConcurrent` pushes into its results array and fires `onResult` for (as a no-op over an empty array). On a 10k-file walk where ~99% of files contain no DSN, this accumulates ~9900 wasted `onResult` invocations and pushes ~9900 empty arrays into the otherwise-unused results buffer. Fix: return `null` from `processEntry` when the file contributes nothing (no raw matches, all matches rejected by host validation, or the file was unreadable). `mapFilesConcurrent` already filters `null` results before invoking `onResult` and before pushing to results, so the empty-work path is fully skipped. This also tidies a parallel subtlety: when `detected.length === 0` (raw matches existed but host validation rejected all of them), we now skip sourceMtimes recording AND the onResult call, consistent with "no DSNs here". Previously we'd push an empty array and record no mtime but still fire onResult — inconsistent signal. Type change: `processEntry(...)` now returns `Promise<DetectedDsn[] | null>`. Private function, no external API impact. ## Test plan - [x] `bun run typecheck` — clean - [x] `bun run lint` — clean (1 pre-existing warning) - [x] `bun test test/lib/dsn/` — **150 pass, 0 fail** (DSN semantics unchanged: 4 DSNs, 76 sourceMtimes, 461 dirMtimes on medium fixture) - [x] `bun test test/lib/scan/` — 153 pass (scan module untouched)

Two low-severity findings from Cursor Bugbot on commit 5bb480c: **`filesRead` counter incremented before read succeeded** `grepFilesInternal`'s per-file callback bumped `opts.stats.filesRead` *before* `readAndGrep` executed. `GrepStats` documents the field as "files whose content was read and tested against the pattern" but the actual count included failed opens (`readAndGrep` returns `null` on read error). Fixed: increment only after a non-null return. **`grepFiles` and `collectGrep` duplicated pipeline setup** Both entry points had ~15 lines of identical code for pattern compilation, matcher compilation, default resolution, walker construction, and filter wiring. Any future change to defaults needed to be applied to both; a divergence between them would be a silent correctness bug. Extracted to a shared `setupGrepPipeline()` helper that returns a `GrepPipelineSetup` bundle consumed by both entry points. `collectGrep` still manages its own `+1 probeLimit` override and the collector-side truncation flag after the shared setup runs. ## Test plan - [x] `bun run typecheck` — clean - [x] `bun run lint` — clean (1 pre-existing warning) - [x] `bun test test/lib test/commands test/types` — **5464 pass, 0 fail** - [x] `bun test test/lib/scan/` — 153 pass (same)

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 7bddf97. Configure here.}

cursor · 2026-04-21T11:12:59Z

+    let ignored = rootResult.ignored && !rootResult.unignored;
+    if (rootResult.unignored) {
+      ignored = false;
+    }


Redundant conditional in isIgnored Tier 2 path

Low Severity

In the Tier 2 path of isIgnored, lines 216-218 (if (rootResult.unignored) { ignored = false; }) are dead code. Line 215 already computes ignored = rootResult.ignored && !rootResult.unignored, which correctly handles every combination: when unignored is true, the && with !unignored already yields false. The subsequent if block can never change the value of ignored.

^{Reviewed by Cursor Bugbot for commit 7bddf97. Configure here.}

## Summary Two regex-level optimizations to narrow the perf gap with ripgrep on our pure-TS `collectGrep`/`grepFiles`. Follow-up to PR #791 and #797. - **Literal prefilter** — ripgrep-style: extract a literal substring from the regex source (e.g., `import` from `import.*from`), scan the buffer with `indexOf` to locate candidate lines, only invoke the regex engine on lines that contain the literal. V8's `indexOf` is roughly SIMD-speed; skipping the regex engine on non-candidate lines is where most of the win comes from. - **Lazy line counting** — swapped `charCodeAt`-walk for `indexOf("\n", cursor)` hops. 2-5× faster on the line-counting sub-loop because V8 implements `indexOf` in C++ without per-iteration JS interop. ## Perf impact (synthetic/large, 10k files, Bun 1.3.11, 4-core) | Op | Before | After | Δ | |---|---:|---:|---:| | `scan.grepFiles` (DSN pattern) | 370 ms | **318 ms** | **−14%** | | `detectAllDsns.cold` | 363 ms | **313 ms** | **−14%** | | `detectDsn.cold` | 7.73 ms | **5.61 ms** | **−27%** | | `scanCodeForFirstDsn` | 2.91 ms | **2.13 ms** | **−27%** | | `scanCodeForDsns` | 342 ms | 333 ms | −3% (noise-equivalent) | | `import.*from` uncapped (bench) | 1489 ms | **1178 ms** | **−21%** | The DSN workloads improve because `DSN_PATTERN` extracts `http` as its literal — most source files don't contain `http` at all, so the prefilter short-circuits before the regex runs. No regressions on any benchmark. Pure-literal patterns (e.g., `SENTRY_DSN`, `NONEXISTENT_TOKEN_XYZ`) continue through the whole-buffer path unchanged. ## What changed ### New file: `src/lib/scan/literal-extract.ts` (~300 LOC) Conservative literal extractor. Walks a regex source looking for the longest contiguous run of literal bytes that every match must contain. Bails out safely on top-level alternation, character classes, groups, lookarounds, quantifiers, and escape classes. Handles escaped metacharacters intelligently: `Sentry\.init` yields `Sentry.init` (extracted via literal `\.` → `.`), while `\bfoo\b` yields `foo` (escape `\b` is an anchor, not a literal `b`). Exports: - `extractInnerLiteral(source, flags)` — returns the literal, or null if no safe extraction possible. Honors `/i` by lowercasing. - `isPureLiteral(source, flags)` — true when the pattern IS a bare literal with no metacharacters. Used by the grep pipeline to route pure-literals to the whole-buffer path (V8's regex engine is hyper-optimized for pure-literal patterns; the prefilter adds overhead without benefit there). ### Modified: `src/lib/scan/grep.ts` (~240 LOC changes) Three-way dispatch in `readAndGrep` based on the extracted literal: 1. **`grepByLiteralPrefilter`** (new) — regex with extractable literal + `multiline: true`. Uses `indexOf(literal)` to find candidate lines, runs the regex engine only on those. This is the main perf win. 2. **`grepByWholeBuffer`** — existing path, used for: - Pure-literal patterns (V8 handles them optimally) - Patterns with no extractable literal (complex regex, top-level alternation) - `multiline: false` mode (the fast path requires per-line semantics) Also: replaced the `charCodeAt`-walk that counted newlines char-by-char with an `indexOf("\n", cursor)` hop loop. Extracted `buildMatch(ctx, bounds)` as a shared helper to bundle the match-construction arguments. ### Tests added - `test/lib/scan/literal-extract.test.ts` — **39 tests** covering the extractor's rules (escape handling, quantifier drop, alternation bail, case-insensitive, minimum length). - `test/lib/scan/grep.test.ts` — **7 new tests** for the prefilter fast path: correctness vs whole-buffer, escaped-literal extraction, case-insensitive flag, zero-literal-hit short-circuit, routing of pure literals to whole-buffer, and alternation routing. ## Why this approach From the ripgrep research (attached to PR #791): rg's central perf trick is extracting a literal from each regex and prefiltering with SIMD memchr. V8 doesn't expose SIMD directly but its `String.prototype.indexOf` is compiled to a tight byte-level loop with internal SIMD on x64 — functionally equivalent for our use case. Three of the five techniques in the Loggly regex-perf guide were evaluated: - **Character classes over `.*`** — `DSN_PATTERN` already uses `[a-z0-9]+`, no change needed. - **Alternation order** — `DSN_PATTERN`'s `(?:\.[a-z]+|:[0-9]+)` is already correctly ordered (`.` more common than `:` in DSN hosts); swapping regressed perf by noise. - **Anchors/word boundaries** — adding `\b` to `DSN_PATTERN` *regressed* perf 2.8× on our workload. V8's existing fast character-mismatch rejection on the first byte outperforms the boundary check overhead. The remaining gap with rg is now primarily orchestration overhead (async/await, `mapFilesConcurrent`, walker correctness features) rather than regex speed. A worker-pool exploration may follow. ## Test plan - [x] `bunx tsc --noEmit` — clean - [x] `bun run lint` — clean (1 pre-existing warning in `src/lib/formatters/markdown.ts` unrelated to this PR) - [x] `bun test --timeout 15000 test/lib test/commands test/types` — **5610 pass, 0 fail** (+58 new) - [x] `bun test test/isolated` — 138 pass, 0 fail - [x] `bun run bench --size large --runs 5` — all scan ops at or below previous baseline - [x] Manually verified semantic parity: `collectGrep` returns identical `GrepMatch[]` on prefilter vs whole-buffer paths for patterns where the prefilter fires 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…#805) ## Summary Walker was 411ms p50 for 10k files on the synthetic/large fixture — roughly 10× slower than a naive sync walk (50ms). Follow-up to PR #791 and #804. Five targeted optimizations in the hot loop, all micro-benchmarked before landing: | Change | Measured win | End-to-end win | |---|---|---| | `readdirSync` instead of `readdir` | 105→45ms (−60ms) | dominant | | `statSync` instead of `Bun.file().size` | −10ms per 10k files | | | String concat for paths | path.join 7ms → concat 0.7ms (10×) | | | `abs.slice(cwdPrefixLen)` for relPath | path.relative 9ms → slice 0.8ms (11×) | | | Manual `lastIndexOf`-based extname | 9ms → 7ms (25%) | | ### Perf results (synthetic/large, 10k files, p50) | Op | Before | After | Δ | |---|---:|---:|---:| | `scan.walk` | 411ms | **231ms** | **−44%** | | `scan.walk.noExt` | 572ms | 448ms | **−22%** | | `scan.walk.dsnParity` | 228ms | **138ms** | **−39%** | | `scanCodeForDsns` | 323ms | 304ms | −6% | | `detectAllDsns.cold` | 327ms | 308ms | −6% | | `detectAllDsns.warm` | 27.9ms | 27.0ms | — | | `scan.grepFiles` | 322ms | 316ms | noise | | `scanCodeForFirstDsn` | 2.23ms | 2.05ms | — | Walker ops are 22-44% faster. Downstream ops (grep, DSN scanner) benefit less because their time is dominated by content scanning, not walking — but still show consistent ~6% improvements. ## Why `readdirSync` is safe here The sync vs async tradeoff usually favors async because blocking the event loop is bad in general. But measured per-call cost matters: | Metric | `readdirSync` on 3635 dirs | |---|---| | p50 | 11µs | | p95 | 24µs | | p99 | 35µs | | max | 65µs | 65µs max block is trivial — `setTimeout(0)` latency in Node is ~4ms. Blocking for 65µs never causes noticeable event-loop pauses. And we pay ~60µs of microtask overhead for each async readdir, which wipes out any theoretical fairness benefit. Net: 2-3× faster per-dir on walks with many small directories, which is every realistic CLI workload. If this ever matters for a weird embedded use case, the optimization is trivially reversible — the walker's public API is unchanged. ## mtime parity fix Initial implementation regressed `detectAllDsns.warm` from 28ms → 304ms because `statSync().mtimeMs` is a float (e.g. `1776790602458.1033`) while `Bun.file().lastModified` is already an integer. The DSN cache validator compares floored `sourceMtimes`, so un-floored floats caused cache misses on every warm call. Fixed by flooring explicitly in `tryYieldFile` — matches the same treatment already applied to `onDirectoryVisit`'s dirMtimes. ## Walker v2 design notes - **Sync readdir + async yield.** Readdir blocks briefly (~65µs max), but the surrounding async-generator interface is preserved. Library consumers that insert async work between iterations still get cooperative scheduling. - **Path ops inlined.** `path.join` / `path.relative` / `path.extname` + `toLowerCase` are all replaced with manual string ops in the hot loop. On Windows, `normalizePath` is still applied via `replaceAll(NATIVE_SEP, "/")` — the POSIX fast path uses a cached `POSIX_NATIVE` module constant. - **`WalkContext.cwdPrefixLen`** precomputed once per walk, used to slice relative paths from absolute paths. - **Cached module constants** `NATIVE_SEP` and `POSIX_NATIVE` avoid per-call `path.sep` property lookups in the hot loop. ## What this PR does NOT change - `WalkEntry` shape: preserved identically (absolutePath, relativePath, size, mtime, isBinary, depth). - `WalkOptions` contract: no new options, no semantics changes. - Symlink handling, `followSymlinks`, cycle detection via `visitedInodes`: unchanged. - IgnoreStack + gitignore semantics: unchanged. - `onDirectoryVisit`, `recordMtimes`, `abortSignal`: unchanged. ## Test plan - [x] `bunx tsc --noEmit` — clean - [x] `bun run lint` — clean (1 pre-existing warning in `src/lib/formatters/markdown.ts` unrelated) - [x] `bun test --timeout 15000 test/lib test/commands test/types` — **5640 pass, 0 fail** - [x] `bun test test/isolated` — 138 pass - [x] `bun test test/lib/scan/walker.test.ts` — 34 pass (incl. property tests covering hidden files, symlinks, maxDepth, gitignore interaction) - [x] `bun test test/lib/dsn/code-scanner.test.ts` — 52 pass (incl. dirMtimes / sourceMtimes cache validation) - [x] DSN count correctness verified end-to-end: 4 DSNs found on fixture (matches pre-change count) - [x] `bun run bench --size large --runs 5 --warmup 2` — results in table above 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cursor Bot reviewed Apr 20, 2026

View reviewed changes

Comment thread src/lib/dsn/code-scanner.ts

Comment thread src/lib/scan/grep.ts Outdated